Forced Derivation Tree based Model Training to Statistical Machine Translation
نویسندگان
چکیده
A forced derivation tree (FDT) of a sentence pair {f, e} denotes a derivation tree that can translate f into its accurate target translation e. In this paper, we present an approach that leverages structured knowledge contained in FDTs to train component models for statistical machine translation (SMT) systems. We first describe how to generate different FDTs for each sentence pair in training corpus, and then present how to infer the optimal FDTs based on their derivation and alignment qualities. As the first step in this line of research, we verify the effectiveness of our approach in a BTGbased phrasal system, and propose four FDTbased component models. Experiments are carried out on large scale English-to-Japanese and Chinese-to-English translation tasks, and significant improvements are reported on both translation quality and alignment quality.
منابع مشابه
Description of KYOTO EBMT System in PatentMT at NTCIR-10
This paper describes“KYOTO”EBMT system that attended PatentMT at NTCIR-10. When translating very different language pairs such as Japanese-English, it is very important to handle sentences in tree structures to overcome the difference. Many of recent studies incorporate tree structures in some parts of translation process, but not all the way from model training (parallel sentence alignment) to...
متن کاملRule Markov Models for Fast Tree-to-String Translation
Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use mini...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملAkamon: An Open Source Toolkit for Tree/Forest-Based Statistical Machine Translation
We describe Akamon, an open source toolkit for tree and forest-based statistical machine translation (Liu et al., 2006; Mi et al., 2008; Mi and Huang, 2008). Akamon implements all of the algorithms required for tree/forestto-string decoding using tree-to-string translation rules: multiple-thread forest-based decoding, n-gram language model integration, beamand cube-pruning, k-best hypotheses ex...
متن کاملN-Gram-Based Statistical Machine Translation versus Syntax Augmented Machine Translation: Comparison and System Combination
In this paper we compare and contrast two approaches to Machine Translation (MT): the CMU-UKA Syntax Augmented Machine Translation system (SAMT) and UPC-TALP N-gram-based Statistical Machine Translation (SMT). SAMT is a hierarchical syntax-driven translation system underlain by a phrase-based model and a target part parse tree. In N-gram-based SMT, the translation process is based on bilingual ...
متن کامل